-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenize whitespace #1570
base: master
Are you sure you want to change the base?
Tokenize whitespace #1570
Conversation
Ast/src/Lexer.cpp
Outdated
|
||
if (updatePrevLocation) | ||
prevLocation = lexeme.location; | ||
|
||
lexeme = readNext(); | ||
updatePrevLocation = false; | ||
} while (skipComments && (lexeme.type == Lexeme::Comment || lexeme.type == Lexeme::BlockComment)); | ||
} while (skipTrivia && (lexeme.type == Lexeme::Comment || lexeme.type == Lexeme::BlockComment || lexeme.type == Lexeme::Whitespace)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to merge skipComments
together into a skipTrivia
toggle
However, also open to splitting them into separate toggles if it makes more sense, e.g. skipWhitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, given the test failures and the invasiveness of the change, decided to end up splitting into skipWhitespace
separately.
But, if the inverse actually makes sense, we can change it :)
Changes to lexer/parser on the hot path require benchmarking. If you don't have a large codebase to do it on, we can measure it for you, but it will take time. |
Some benchmarking after running on a ~1400 KLOC repository. This is 10 runs of
And decided to run in hyperfine if it's a helpful comparison:
If I keep running it though the numbers do fluctuate. Let me know if I should do something different to benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to preserve specific whitespace given the overhead.
We already able to reproduce files with original formatting mostly only given the location information (that's how Transpiler works).
I guess we are losing some info like mixed tabs/spaces and maybe garbage on the end of lines.
Since whitespace lexemes are not even captured yet, I wonder if you can fill in the blanks in a separate pass based on that location information of the nodes.
Or maybe we don't need to even build that structure and when we output the whitespace we can go and look what kind of symbols were used in the original file (even if AST nodes were moved around it should be reconstructable).
It might be also be acceptable in the current form if maybe we optimize the thing that is causing the perf overhead (maybe instead of checking 5 token types we can pack those enum values together and do unsigned(token - unimportant_start) <= (unimportant_last - unimportant_start) or something else.
Thanks, this makes sense, will take a look at this. For context, my original thinking was to make the lexer retain all source information so that given a stream of just lexer tokens, we could recreate the original input. Then, given an AST node and a set of related lexer tokens (would have to be stored during parsing), we'd be able to recreate exactly how that node was written. Indeed, this would be important so that we could preserve the exact style of whitespace (spaces vs. tabs, leading/trailing whitespace, etc.). But, the general idea of "filling in the gaps" with just location info is interesting. I did not explore this much, I'll try playing around with that further.
This would probably be good if the end goal was just to reconstruct the input, but I imagine if we did not build some sort of structure then the interface to perform code transforms would be a lot harder to work with, especially if you want to modify how the whitespace is. I can see it being doable though. |
We would like to develop tooling that can consume the syntax tree and perform transformations whilst preserving the underlying structure of the source code. This requires the lexer / parser to retain enough information to be able to re-print the original code.
The first step towards this is to ensure the lexer preserves the input structure. In this commit, we enable tokenization of whitespace, stored in a new Lexeme of type
Whitespace
. Multiline whitespace is all stored in a single Lexeme.